The Exploration
More Notes and Possible Extensions
Pros of this set
Why this is a good data set and problem:
TODO
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
### EXPLORATION
head(wine)
## X country
## 1 0 US
## 2 1 Spain
## 3 2 US
## 4 3 US
## 5 4 France
## 6 5 Spain
## description
## 1 This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## 2 Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.
## 3 Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and complex botrytised white. Dark gold in color, it layers toasted hazelnut, pear compote and orange peel flavors, reveling in the succulence of its 122 g/L of residual sugar.
## 4 This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vineyards, among others. Aromatic, dense and toasty, it deftly blends aromas and flavors of toast, cigar box, blackberry, black cherry, coffee and graphite. Tannins are polished to a fine sheen, and frame a finish loaded with dark chocolate and espresso. Drink now through 2032.
## 5 This is the top wine from La Bégude, named after the highest point in the vineyard at 1200 feet. It has structure, density and considerable acidity that is still calming down. With 18 months in wood, the wine has developing an extra richness and concentration. Produced by the Tari family, formerly of Château Giscours in Margaux, it is a wine made for aging. Drink from 2020.
## 6 Deep, dense and pure from the opening bell, this Toro is a winner. Aromas of dark ripe black fruits are cool and moderately oaked. This feels massive on the palate but sensationally balanced. Flavors of blackberry, coffee, mocha and toasty oak finish spicy, smooth and heady. Drink this exemplary Toro through 2023.
## designation points price province
## 1 Martha's Vineyard 96 235 California
## 2 Carodorum Selección Especial Reserva 96 110 Northern Spain
## 3 Special Selected Late Harvest 96 90 California
## 4 Reserve 96 65 Oregon
## 5 La Brûlade 95 66 Provence
## 6 Numanthia 95 73 Northern Spain
## region_1 region_2 variety
## 1 Napa Valley Napa Cabernet Sauvignon
## 2 Toro Tinta de Toro
## 3 Knights Valley Sonoma Sauvignon Blanc
## 4 Willamette Valley Willamette Valley Pinot Noir
## 5 Bandol Provence red blend
## 6 Toro Tinta de Toro
## winery
## 1 Heitz
## 2 Bodega Carmen Rodríguez
## 3 Macauley
## 4 Ponzi
## 5 Domaine de la Bégude
## 6 Numanthia
range(wine$points)
## [1] 80 100
pts = data.frame(wine$points)
ggplot(data=pts, aes(pts)) + stat_count(fill="green", colour="blue") + geom_density(color="blue")
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
### a look at high point wine descriptions
sample(wine$description[wine$points > 95], size = 10)
## [1] This is the best of the winery's new releases, both for drinking now and for cellaring. Combines power and elegance, with upscale flavors of cherries, currants, cola, rhubarbs, dried herbs and oak, wrapped into devastatingly rich tannins and a perfect touch of acidity. Should age well for 10 years.
## [2] Ripe and delicious right out of the bottle, although it will age well. Made from a vineyard on the true Sonoma Coast, in the remote Fort Ross area, it's fabulously rich in cherry, framboise and red currant flavors, with intriguing waves of anise, bacon and sweet, vanilla-scented smoky sandalwood. There's a foresty taste, too, suggesting dried pine needles. The tannins and acids are near perfect. Stunning now, and should hold for at least six years.
## [3] A cool climate has given this wine bracing acidity and gentle-yet-persistent tannins. Its flavors of wild raspberries and cherries, with spice and mineral overtones, are delicious. There's a feral quality that expresses umami perfectly, suggesting butter-sautéed, tamari-splashed mushrooms and crisped prosciutto. The oak is powerful and present in the form of toast and sweet vanilla, but perfectly balanced with the wine's volume. For all its richness, the wine finishes bone dry. Drink this impeccable Pinot Noir now through 2020.
## [4] This may be from the exceptional vintage of 2003, but Château Margaux remains true to form. First and foremost, it is a refined, elegant wine, with complex layers of flavors. But, yes, the hot summer is there the dense, dry tannins, but somehow they seem to float through the wine rather than sitting heavily in the middle. Acidity and freshness come to finish, giving the wine a delicious lift. Imported by Diageo Chateau & Estates.
## [5] A monumental Cabernet that succeeds on every level. It's deep, dark and ultrarich, showing massive blackberries, black currants, dark chocolate, violets, minerals and a touch of sweet, smoky oak. Beyond the flavors is the dramatic structure, with fabulously ripe, dense tannins and a bone-dry, long finish. Surely a tremendous wine capable of long-term aging. Try 2012–2018, at least.
## [6] Even better than the highly acclaimed 2001 vintage, this Merlot from a tiny, clay soil vineyard in Bolgheri is just about everything you've ever dreamed of tasting. Gorgeous, generous, voluptuous, cheerful, succulent and intense: Masseto is all those things. The aromas are seamless and capture the essence of chocolate fudge, sweet cherry, blackberry, spice and vanilla. It boasts thick, dense extraction, excellent structure and amazing persistence.
## [7] This is a great wine. It has all the elements in place to produce a wine that will last for years: powerful fruit, rich tannins and a structure that is built to last. This is elegant, impressive and concentrated. Almost entirely Cabernet Sauvignon, the grape gives the wine its fruit and its tannins. Drink from 2028.
## [8] The 2010 Ornellaia celebrates the wine's 25th anniversary, and it could not have been a more spectacular vintage. It opens with aromas of crushed blue flowers, black berries, pipe tobacco and thyme that give way to an elegant, structured and polished palate. It delivers intense blackberry flavors layered with white pepper, Mediterranean herbs, mineral and mocha brightened by fresh acidity alongside smooth, velvety tannins. This will age and develop for decades. Drink 2016–2040.
## [9] A stunning wine with amazing intensity and a very distinct personality, this has aromas of ripe fruit, leather, tobacco, cola, dried ginger and cigar box. The mouthfeel is very long, smooth and velvety. Age this wine 10–15 more years.
## [10] With all the extreme intensity of this vintage, this wine brings even more. The start of cookie yeast and toast character, a vibrant mineral texture, hints of spice, green herbs and a final white peach flavor. It is still young, just keep aging.
## 97821 Levels: . Big, lively and very intense, this powerful Amarone opens with a blast of blackberry, spice, vanilla, chocolate, leather and tobacco aromas. It shows sticky, extracted concentration in the mouth and ripe, succulent flavors of jammy fruit and cherry liqueur. It's a huge wine from all points of view. ...
### a look at low point wine descriptions
sample(wine$description[wine$points < 83], size = 10)
## [1] Simple but cheerful flavors of cherry and berry lead this blend from Virginia. Soft in structure and lacking some dimension, the wine is nonetheless appealing for its bright character.
## [2] Sweet and sugary, with a green, gooseberry edge to the lemon and lime flavors.
## [3] This wine offers some blackberry aromas and flavors. On the downside are herbaceous green notes, especially on the rather sharp finish. It's a thin wine.
## [4] Pours dark and heavy looking, and the aroma suggests Port, which the first sip confirms. Despite official descriptions as dry, the wine tastes sweet and hot, thanks to nearly 16 percent of alcohol.
## [5] Way too high in alcohol, and possibly residual sugar, which gives this Zin a sweet, syrupy blackberry jelly taste.
## [6] Tastes like a slightly sweet soda pop or wine cooler drink, with orange, cherry and vanilla flavors. Don't be afraid to put some sliced strawberries in there, and maybe a little umbrella.
## [7] Even in the cool Santa Monica Mountains of Malibu the heat waves of this vintage struck, and this wine suffered. It has Porty, raisiny aromas and a stewed-fruit taste, with heat on the finish.
## [8] Tastes sugared, and the fruit is thin, making the alcohol taste hot.
## [9] This is a pretty good everyday Cab if you're looking for something full bodied, dry and fruity. There's some sharpness, but the blackberry, raisin and chocolate flavors are nice.
## [10] Hot in alcohol, Porty-sweet in raspberry and cherry fruit, but soft in acidity, this is a simple wine in a good-looking bottle.
## 97821 Levels: . Big, lively and very intense, this powerful Amarone opens with a blast of blackberry, spice, vanilla, chocolate, leather and tobacco aromas. It shows sticky, extracted concentration in the mouth and ripe, succulent flavors of jammy fruit and cherry liqueur. It's a huge wine from all points of view. ...
### CLEANING
## project into (points, price, description), remove all rows with any NA
nrow(wine)
## [1] 150930
sum(complete.cases(wine)) ##looking at data, seems due to large # of NA in region columns
## [1] 137235
sum(complete.cases(data.frame(wine[, c("points","price", "description")])))
## [1] 137235
wine_projection = wine[complete.cases(wine[, c("points","price", "description")]), c("points","price", "description")]
corr(as.matrix(wine_projection[,c("price","points")]))
## [1] 0.4598634
length(levels(wine$country))
## [1] 49
length(levels(wine$variety))
## [1] 632
logit_transform = function(points){
log((points - 79.95) / (100.05 - points))
}
logit_points = logit_transform(wine$points)
range(logit_points)
## [1] -5.993961 5.993961
wine_projection = data.frame(wine_projection, logit_points = logit_transform(wine_projection$points))
head(wine_projection)
## points price
## 1 96 235
## 2 96 110
## 3 96 90
## 4 96 65
## 5 95 66
## 6 95 73
## description
## 1 This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## 2 Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.
## 3 Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and complex botrytised white. Dark gold in color, it layers toasted hazelnut, pear compote and orange peel flavors, reveling in the succulence of its 122 g/L of residual sugar.
## 4 This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vineyards, among others. Aromatic, dense and toasty, it deftly blends aromas and flavors of toast, cigar box, blackberry, black cherry, coffee and graphite. Tannins are polished to a fine sheen, and frame a finish loaded with dark chocolate and espresso. Drink now through 2032.
## 5 This is the top wine from La Bégude, named after the highest point in the vineyard at 1200 feet. It has structure, density and considerable acidity that is still calming down. With 18 months in wood, the wine has developing an extra richness and concentration. Produced by the Tari family, formerly of Château Giscours in Margaux, it is a wine made for aging. Drink from 2020.
## 6 Deep, dense and pure from the opening bell, this Toro is a winner. Aromas of dark ripe black fruits are cool and moderately oaked. This feels massive on the palate but sensationally balanced. Flavors of blackberry, coffee, mocha and toasty oak finish spicy, smooth and heady. Drink this exemplary Toro through 2023.
## logit_points
## 1 1.376992
## 2 1.376992
## 3 1.376992
## 4 1.376992
## 5 1.091990
## 6 1.091990
mod = lm(logit_points~price, data=wine_projection)
plot(mod)
mod2 = lm(logit_points~log(price)+sqrt(price), data=wine_projection)
plot(mod2)
summary(mod2)
##
## Call:
## lm(formula = logit_points ~ log(price) + sqrt(price), data = wine_projection)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2672 -0.3104 0.1085 0.4490 5.8597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.993260 0.017622 -169.86 <2e-16 ***
## log(price) 0.815746 0.010456 78.01 <2e-16 ***
## sqrt(price) -0.034446 0.003343 -10.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7398 on 137232 degrees of freedom
## Multiple R-squared: 0.2898, Adjusted R-squared: 0.2898
## F-statistic: 2.8e+04 on 2 and 137232 DF, p-value: < 2.2e-16
ggplot(data=wine_projection, aes(logit_points)) + stat_count(fill="green", colour="blue") + geom_density(color="blue")
wine_projection = data.frame(wine_projection, inv_price = 1/wine_projection$price, price13 = wine_projection$price^(1/3) )
head(wine_projection)
## points price
## 1 96 235
## 2 96 110
## 3 96 90
## 4 96 65
## 5 95 66
## 6 95 73
## description
## 1 This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## 2 Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.
## 3 Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and complex botrytised white. Dark gold in color, it layers toasted hazelnut, pear compote and orange peel flavors, reveling in the succulence of its 122 g/L of residual sugar.
## 4 This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vineyards, among others. Aromatic, dense and toasty, it deftly blends aromas and flavors of toast, cigar box, blackberry, black cherry, coffee and graphite. Tannins are polished to a fine sheen, and frame a finish loaded with dark chocolate and espresso. Drink now through 2032.
## 5 This is the top wine from La Bégude, named after the highest point in the vineyard at 1200 feet. It has structure, density and considerable acidity that is still calming down. With 18 months in wood, the wine has developing an extra richness and concentration. Produced by the Tari family, formerly of Château Giscours in Margaux, it is a wine made for aging. Drink from 2020.
## 6 Deep, dense and pure from the opening bell, this Toro is a winner. Aromas of dark ripe black fruits are cool and moderately oaked. This feels massive on the palate but sensationally balanced. Flavors of blackberry, coffee, mocha and toasty oak finish spicy, smooth and heady. Drink this exemplary Toro through 2023.
## logit_points inv_price price13
## 1 1.376992 0.004255319 6.171006
## 2 1.376992 0.009090909 4.791420
## 3 1.376992 0.011111111 4.481405
## 4 1.376992 0.015384615 4.020726
## 5 1.091990 0.015151515 4.041240
## 6 1.091990 0.013698630 4.179339
grid = regsubsets(logit_points~.+log(price)+sqrt(price)-points-description, data=wine_projection, nvmax=13)
summary(grid)
## Subset selection object
## Call: regsubsets.formula(logit_points ~ . + log(price) + sqrt(price) -
## points - description, data = wine_projection, nvmax = 13)
## 5 Variables (and intercept)
## Forced in Forced out
## price FALSE FALSE
## inv_price FALSE FALSE
## price13 FALSE FALSE
## log(price) FALSE FALSE
## sqrt(price) FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## price inv_price price13 log(price) sqrt(price)
## 1 ( 1 ) " " " " " " "*" " "
## 2 ( 1 ) " " "*" " " "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" "*" "*" " " "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
grid$rss
## [1] 105760.56 76693.90 75228.79 75091.92 75090.98 75088.11
(train_rmse_from_grid = sqrt(grid$rss/nrow(wine_projection)))
## [1] 0.8778684 0.7475633 0.7403884 0.7397145 0.7397099 0.7396958
summary(grid)$cp
## [1] 138.836337 9.932217 8.177749 4.281646 6.000000
which.min(summary(grid)$cp)
## [1] 4
summary(grid)$bic
## [1] -46840.67 -46959.68 -46951.61 -46945.67 -46934.13
which.min(summary(grid)$bic)
## [1] 2
Result of regsubsets, additive, no interactions
mod3 = glm(logit_points~log(price)+inv_price, data=wine_projection)
plot(mod3)
summary(mod3)
##
## Call:
## glm(formula = logit_points ~ log(price) + inv_price, data = wine_projection)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2524 -0.3189 0.1163 0.4524 5.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.484857 0.032905 -75.52 <2e-16 ***
## log(price) 0.631856 0.007673 82.35 <2e-16 ***
## inv_price -2.044902 0.178734 -11.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5472014)
##
## Null deviance: 105761 on 137234 degrees of freedom
## Residual deviance: 75094 on 137232 degrees of freedom
## AIC: 306717
##
## Number of Fisher Scoring iterations: 2
ggplot(wine_projection, aes(x=wine_projection$inv_price, y=wine_projection$logit_points)) + ggtitle("Logit Points versus inverse price") + theme(plot.title = element_text(hjust = 0.5)) + labs(x="inverse price",y="logit points") + geom_point(shape=1, color="blue") + geom_smooth(method=lm, color="green")
ggplot(wine_projection, aes(x=log(wine_projection$price), y=wine_projection$logit_points)) + ggtitle("Logit Points versus log price") + theme(plot.title = element_text(hjust = 0.5)) + labs(x="log price",y="logit points") + geom_point(shape=1, color="blue") + geom_smooth(method=lm, color="green")
mod4 = glm(logit_points~log(price)+inv_price + price13 + price, data=wine_projection)
plot(mod4)
summary(mod4)
##
## Call:
## glm(formula = logit_points ~ log(price) + inv_price + price13 +
## price, data = wine_projection)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2537 -0.3168 0.1144 0.4514 5.8691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7340007 0.1069852 -25.555 <2e-16 ***
## log(price) 0.8651845 0.1008640 8.578 <2e-16 ***
## inv_price -0.8139193 0.5374550 -1.514 0.1299
## price13 -0.1934940 0.0870629 -2.222 0.0263 *
## price 0.0005801 0.0003335 1.739 0.0820 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5471855)
##
## Null deviance: 105761 on 137234 degrees of freedom
## Residual deviance: 75090 on 137230 degrees of freedom
## AIC: 306715
##
## Number of Fisher Scoring iterations: 2
## SCREE plot to determine optimal number of price clusters
dat = data.frame(wine_projection$price)
wss = (nrow(dat)-1)*sum(apply(dat,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(dat,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
km.out = kmeans(wine_projection$price, 3)
plot(wine_projection$price, col=(km.out$cluster + 1))
ggplot(data=wine_projection[km.out$cluster==1,], aes(price)) + stat_count(fill="green", colour="blue") + geom_density(color="blue")
ggplot(data=wine_projection[km.out$cluster==2,], aes(price)) + stat_count(fill="green", colour="blue") + geom_density(color="blue")
ggplot(data=wine_projection[km.out$cluster==3,], aes(price)) + stat_count(fill="green", colour="blue") + geom_density(color="blue")
pr_cluster = as.factor(km.out$cluster)
levels(pr_cluster)
## [1] "1" "2" "3"
levels(pr_cluster) = c("med","high","low")
levels(pr_cluster)
## [1] "med" "high" "low"
wine_projection = data.frame(wine_projection, p_cluster=pr_cluster)
mod5 = glm(logit_points~log(price)*p_cluster, data=wine_projection)
plot(mod5)
summary(mod5)
##
## Call:
## glm(formula = logit_points ~ log(price) * p_cluster, data = wine_projection)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2556 -0.3144 0.0986 0.4394 5.8589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.40891 0.05813 -41.441 <2e-16 ***
## log(price) 0.60943 0.01393 43.747 <2e-16 ***
## p_clusterhigh -0.49868 0.06010 -8.298 <2e-16 ***
## p_clusterlow 0.07999 0.36713 0.218 0.828
## log(price):p_clusterhigh 0.12448 0.01482 8.400 <2e-16 ***
## log(price):p_clusterlow -0.01112 0.06529 -0.170 0.865
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5473072)
##
## Null deviance: 105761 on 137234 degrees of freedom
## Residual deviance: 75106 on 137229 degrees of freedom
## AIC: 306746
##
## Number of Fisher Scoring iterations: 2
mod6 = glm(logit_points~inv_price*p_cluster, data=wine_projection)
plot(mod6)
summary(mod6)
##
## Call:
## glm(formula = logit_points ~ inv_price * p_cluster, data = wine_projection)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2861 -0.3114 0.0969 0.4421 5.8267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.83348 0.01691 49.295 < 2e-16 ***
## inv_price -43.30427 0.99930 -43.335 < 2e-16 ***
## p_clusterhigh -0.86770 0.01772 -48.974 < 2e-16 ***
## p_clusterlow 0.92024 0.08618 10.678 < 2e-16 ***
## inv_price:p_clusterhigh 31.09292 1.00300 31.000 < 2e-16 ***
## inv_price:p_clusterlow -145.08665 22.10811 -6.563 5.31e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5507593)
##
## Null deviance: 105761 on 137234 degrees of freedom
## Residual deviance: 75580 on 137229 degrees of freedom
## AIC: 307609
##
## Number of Fisher Scoring iterations: 2
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
#descr = gsub("[][!#$%()*,.:;<=>@^_|~.{}]", "", as.character(wine_projection$description))
corpus = Corpus(VectorSource(as.character(wine_projection$description)))
inspect(corpus[1:2])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 2
##
## [1] This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## [2] Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.
corpus = tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, stemDocument)
inspect(corpus[1:2])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 2
##
## [1] tremend 100 variet wine hail oakvill age three year oak juici redcherri fruit compel hint caramel greet palat frame eleg fine tannin subtl minti tone background balanc reward start finish year ahead develop nuanc enjoy 20222030
## [2] ripe aroma fig blackberri cassi soften sweeten slather oaki chocol vanilla full layer intens cushion palat rich flavor chocolati black fruit bake spice toasti everlast finish headi ideal balanc drink 2023
#meta(corpus, type="corpus")
corpus_dtm = DocumentTermMatrix(corpus)
corpus_lda = LDA(corpus_dtm, k = 2, control = list(seed = 1234))
corpus_documents = tidy(corpus_lda, matrix = "gamma")
wine_with_lda = data.frame(wine_projection, topic1 = corpus_documents$gamma[1:nrow(wine_projection)], topic2 = corpus_documents$gamma[(nrow(wine_projection)+1):nrow(corpus_documents)])
mod7 = glm(logit_points~inv_price*p_cluster + topic1, data=wine_with_lda)
plot(mod7)
summary(mod7)
##
## Call:
## glm(formula = logit_points ~ inv_price * p_cluster + topic1,
## data = wine_with_lda)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2782 -0.3091 0.1025 0.4429 5.8383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.54811 0.10276 15.066 < 2e-16 ***
## inv_price -43.25620 0.99915 -43.293 < 2e-16 ***
## p_clusterhigh -0.86595 0.01772 -48.879 < 2e-16 ***
## p_clusterlow 0.91635 0.08617 10.634 < 2e-16 ***
## topic1 -1.43252 0.20317 -7.051 1.79e-12 ***
## inv_price:p_clusterhigh 31.04668 1.00284 30.959 < 2e-16 ***
## inv_price:p_clusterlow -144.08505 22.10465 -6.518 7.13e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5505638)
##
## Null deviance: 105761 on 137234 degrees of freedom
## Residual deviance: 75553 on 137228 degrees of freedom
## AIC: 307562
##
## Number of Fisher Scoring iterations: 2
corpus_topics <- tidy(corpus_lda, matrix = "beta")
corpus_topics
## # A tibble: 65,304 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 100 4.664335e-04
## 2 2 100 7.077942e-04
## 3 1 20222030 1.119677e-06
## 4 2 20222030 6.400127e-07
## 5 1 age 2.741178e-03
## 6 2 age 4.352850e-03
## 7 1 ahead 8.223265e-05
## 8 2 ahead 3.273607e-05
## 9 1 background 3.161070e-04
## 10 2 background 1.449376e-04
## # ... with 65,294 more rows
corpus_top_terms <- corpus_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
corpus_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
corpus_sentiments <- tidy(corpus_dtm) %>%
inner_join(get_sentiments("bing"), by = c(term = "word")) %>%
count(document, sentiment, wt = count) %>%
ungroup() %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
arrange(sentiment)
beta_spread <- corpus_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_spread
## # A tibble: 270 x 4
## term topic1 topic2 log_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 accent 0.0011319352 1.632377e-03 0.5281827
## 2 acid 0.0101020896 1.089604e-02 0.1091502
## 3 add 0.0015936485 6.447203e-04 -1.3055881
## 4 age 0.0027411779 4.352850e-03 0.6671645
## 5 alcohol 0.0007706188 2.302083e-03 1.5788504
## 6 almond 0.0015819072 2.658711e-04 -2.5728660
## 7 almost 0.0008838961 2.106729e-03 1.2530560
## 8 along 0.0013260939 1.152072e-03 -0.2029525
## 9 alongsid 0.0010304528 9.993428e-05 -3.3661550
## 10 also 0.0036764899 4.080261e-04 -3.1715957
## # ... with 260 more rows
ggplot(beta_spread[order(abs(beta_spread$log_ratio)),][1:30,], aes(x=term, y=log_ratio)) + geom_bar(stat="identity", fill="green", width=.2) + coord_flip() + ggtitle("Log ratio of beta in topic2/topic1")
wine_with_lda2 = data.frame(wine_with_lda, document = rownames(wine_with_lda))
wine_with_sentiment = merge(wine_with_lda2, corpus_sentiments[,c(1,4)], by = "document")
sample_bad_reviews = sample(wine_with_sentiment$description[wine_with_sentiment$sentiment == -5], size=3)
#review_sentiment = ap_sentiments[order(as.numeric(ap_sentiments$document)),]
#wine_with_lda = data.frame(wine_with_lda, review_sentiment = review_sentiment)
mod8 = glm(logit_points~inv_price*p_cluster + topic1 + sentiment, data=wine_with_sentiment)
plot(mod8)
summary(mod8)
##
## Call:
## glm(formula = logit_points ~ inv_price * p_cluster + topic1 +
## sentiment, data = wine_with_sentiment)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2879 -0.3087 0.1003 0.4426 5.8321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.673e+00 1.103e-01 15.176 < 2e-16 ***
## inv_price -4.399e+01 1.059e+00 -41.523 < 2e-16 ***
## p_clusterhigh -8.822e-01 1.881e-02 -46.903 < 2e-16 ***
## p_clusterlow 9.281e-01 8.986e-02 10.329 < 2e-16 ***
## topic1 -1.636e+00 2.180e-01 -7.505 6.19e-14 ***
## sentiment -2.138e-03 1.173e-03 -1.823 0.0683 .
## inv_price:p_clusterhigh 3.172e+01 1.063e+00 29.823 < 2e-16 ***
## inv_price:p_clusterlow -1.497e+02 2.309e+01 -6.484 8.99e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5466062)
##
## Null deviance: 91243 on 118939 degrees of freedom
## Residual deviance: 65009 on 118932 degrees of freedom
## AIC: 265704
##
## Number of Fisher Scoring iterations: 2
mod9 = glm(logit_points~inv_price*p_cluster + sentiment, data=wine_with_sentiment)
plot(mod9)
summary(mod9)
##
## Call:
## glm(formula = logit_points ~ inv_price * p_cluster + sentiment,
## data = wine_with_sentiment)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.2920 -0.3143 0.0996 0.4426 5.8187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.570e-01 1.805e-02 47.476 < 2e-16 ***
## inv_price -4.404e+01 1.060e+00 -41.560 < 2e-16 ***
## p_clusterhigh -8.841e-01 1.881e-02 -46.997 < 2e-16 ***
## p_clusterlow 9.336e-01 8.988e-02 10.388 < 2e-16 ***
## sentiment -2.125e-03 1.173e-03 -1.812 0.07 .
## inv_price:p_clusterhigh 3.176e+01 1.064e+00 29.858 < 2e-16 ***
## inv_price:p_clusterlow -1.511e+02 2.309e+01 -6.544 6.01e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5468604)
##
## Null deviance: 91243 on 118939 degrees of freedom
## Residual deviance: 65040 on 118933 degrees of freedom
## AIC: 265758
##
## Number of Fisher Scoring iterations: 2